Intro to ggplot

Author

James Van Slyke

Using `ggplot`

ggplot is a graphics language that we use to make graphs. The text we will use for our first go at making graphs is called R for Data Science

Like a lot of other attractive things about R and R Studio the book is free!! and probably one of the best resources for understanding R and Data.

You can find the book here

I’ll be using several modified examples from Chapter 3 of that book.

The Tidyverse

Tidyverse is a package we’ll be using throughout our exploration of statistics. There are lots of great helps for doing data science.

First you’ll want to load “tidyverse” as a package. Here’s how we find and load packages in R Studio.

Search under “Packages” in the bottom right window
Click on install and search for tidyverse in the CRAN repository
Once you’ve found tidyverse, click on install - It may take some time for the package to download and install - wait until it is all done.
Once the package is downloaded just make sure to check next to the package on your list of packages OR simply use this code.
```
library(tidyverse)
```

`mpg` dataset

We’ll start off by using a dataset called “mpg” It should have loaded with the tidyverse package. Just type it in to find it

mpg

# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

It shows the dataset as a “tibble”, which just means table in the tidyverse

Because it’s a preloaded dataset we can get information about it by using a question mark. A question mark in front of any command in R will automatically generate the help file

?mpg

Levels of Measurement

Categorical
- Binary variable
- Nominal variable
- Ordinal variable
Continuous
- Interval variable
- Ratio variable

Datasets contain several different variables. Notice in the mpg dataset, there are several different variables such as model, year, cty, and hwy. The str command shows all the different variables.

str(mpg)

tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
 $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
 $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
 $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
 $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
 $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
 $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
 $ drv         : chr [1:234] "f" "f" "f" "f" ...
 $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
 $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
 $ fl          : chr [1:234] "p" "p" "p" "p" ...
 $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

These are several different types of variables because they represent different kinds of things. In statistics we refer to this as levels of measurement. There are two broad types of variables, categorical and continuous (Field, 2009, pages 8-9). Categorical variables are based on particular categories, such a type of shoe, religious affiliation, or political affiliation. A binary variable is a type of categorical variable that only takes two categories. So things like being pregnant or not, voting yes or no on a certain bill, or being alive or dead are binary variables, there are only two possible things in the category. Categorical variables that take on more than 2 possibilities are called nominal variables (nominal means names)

Categorical

There’s no mathematics involved in determining categorical variables, it’s simply based on which category has the most accurate fit. Either you are a republican, democrat, or independent, there is no mathematical quantity that would determine the appropriate category. There is one type of categorical variable that is ordered based on the absence or presence of a particular property, which is an ordinal variable. Ordinal variables have a particular rank order, but the rankings are not equal or uniform. So for example, college basketball team rankings would be an ordinal variable. The teams are ranked from best to worst, but team #2 may be twice is good as team #3, while team #5 maybe four times as good as team #6. So ordinal data tells us more about the variable than nominal data, but it still doesn’t provide a standardized scale of measurement.

Continuous

Continuous variables are the second broad category and involves some type of numerical measurement to define the property in the variable. Continuous variables can take on any value in the measurement scale defined by the variable. The first example of this type variable is an interval variable. These types of variables are based on a measurement scale with equal distances between the ranks based on the property measured; equal intervals in the scale are able to represent equal differences in the variable being measured. The Fahrenheit scale would be an interval scale because the difference in degree between 64 and 65 is the same as 74 and 75.

Ratio

Ratio variables add one more dimension to the properties of an interval variable. In addition to having an equal distance in ranks, ratios have an absolute zero point. This allows for multiplication of the intervals or the use of ratios. So on a ratio scale of 0 to 5, a score of 4 would be twice as good as a score of 2. Time is a good example of a ratio scale. There is an absolute zero point (there is no such thing as negative time), the distance between 40 and 60 seconds is the same as the distance between 30 and 10 seconds. And 20 seconds is twice as long as 10 seconds. Another example, Fahrenheit is an interval scale because you can have negative degrees (e. g. -2 degrees below zero), while the Kelvin scale is a ratio scale because there is an absolute zero point and no negative numbers.

One other property of continuous variables is that they allow for different degrees of measurement precision. So for example, the continuous ratio variable of time can be measured in hours, minutes, seconds, milliseconds, etc. In contrast, a discrete variable can only take on a fixed measurement scale, like a rank scale of 1 to 10. The scale requires that you choose a value between 1 and 10, 2.5 is not possible.

Scatterplot

Let’s start by creating a simple scatterplot graph

ggplot(data = mpg) +
  geom_point(mapping = aes(x= displ, y=hwy))

There’s a few things to learn here in the code

ggplot is the basic command and within the parentheses is the data we’ll be running our graph on
geom_point is the type of graph will be using, geom stands for geometrical shape. So in this case we are making a graph of “points”
The “mapping” argument lays out the variables we are graphing and is always paired with “aes”. Finally x and y lay out which variables are going to be on our x and y axes.

So our basic formula looks lke this:

ggplot(data = ) + (mapping = aes())

About the variables

If you look at the variables, you’ll notice that hwy and displ are numbers and R views them as integers or whole numbers. Scatterplots usually require data that are numeric like this, numbers that go up in scale. So in this case scatterplots typically require interval or ratio data.

One peculiar thing about this graph is the group of numbers just above the numbers in the right corner. The general trend we see in these variables is that as displacement goes up (i.e. the engine uses more gasoline) the highway mileage goes down. This is sometimes referred to as a negative relationship. The cars with the best highway mileage displace the least amount of gasoline. However, there’s a group of dots that range between 20 and 30 on the hwy variable, but seem to displace more gasoline than most. How can this be?

Adding additional variables

One way we can answer this question is by adding in an additional variable. In this case the class or type of car. Let’s add that in now as an additional variable to our graph.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

Notice how most of those dots are 2 seater cars. Thus, the reason they get better gas mileage is because they are smaller cars!!

Class is a particular kind of variable, in this case a nominal or categorical variable. r refers to them as characters. These are basically categories, so we can’t do any math on them. You can’t make a formula out of compact x midsize or minivan/pickup.

Add Color

Let’s do one more graph, but add a little color

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

Notice how when the word “color” is outside the x and y mappings it changes the color of our points. You can try several different colors for the dots on your scatterplot.

Bar Graphs

Let’s learn one more graph, which is especially important for categorical variables, the bar graph.

First, check out the diamonds data set

diamonds

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

The basics of the formula are going to be the same, but this time we are using geom_bar as our geometrical shape to create a bar graph.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

If you look in section 3.7 in R for Data Science it explains what R is doing with the data

So R looks at the data set and automatically uses the count statistic to simply count the number of occurrences for that variable. So this only works for categorical variables.

Let’s inspect the cut variable to see if this is true. Remember that the str command allows us to investigate the types of variables in the dataset.

str(diamonds)

tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Each of the variables are preceded by the $ sign. So we can see that cut is the second variable. Notice that it says cut is an ord.factor with 5 levels. This tells us that this variable is an ordinal variable because it is ordered based on the type of cut from fair (lowest type of cut) to ideal (the best type of cut).

Another example

Here’s another example from scratch. First I’ll create a quick dataset using the tribble command.

Example <- tribble(
  ~group, ~number, 
  "Group 1", 30, 
  "Group 2", 50
)

Notice that our variable group is a categorical variable, but on it’s own the stat count won’t really tell us much (Group 1 = 1, Group 2 =1). So notice what happens when we try to make a bar graph like our last example.

ggplot(data = Example) + 
  geom_bar(mapping = aes(x = group))

Doesn’t really tell us much does it? So in this case the count has to be supplied by a second variable, y. When we use a y variable for our bar chart we have to use a different stat for the bar chart, in this case, the stat is “identity” so the code looks like this.

ggplot(data = Example) +
  geom_bar(mapping = aes(x = group, y = number), stat = "identity")

Now it’s easy to see the difference in count between Group 1 and Group 2

Using ggplot